This notebook is to explore the data collected by ABC Bank to predict customer churn.
#Importing necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
#importing data
churn_data=pd.read_csv("../raw data/Churn_data.csv")
#Preview of data
churn_data.head()
#Reviewing data types
churn_data.dtypes
The data has 13 independant features of various data types (int,float,object) and the exited column is the dependant feature of interest.
The HasCrCard and IsActiveMember are formatted as int64, but these are categorical features, so we should convert it into objects.
The dependant feature is formatted as int64, but we might want to convert it into object to make this a classification problem.
The customer name feature might not be of much use. This could be discarded later.
#Converting into correct formats
churn_data["Geography"]=churn_data["Geography"].astype("category")
churn_data["Gender"]=churn_data["Gender"].astype("category")
churn_data["HasCrCard"]=churn_data["HasCrCard"].astype("category")
churn_data["IsActiveMember"]=churn_data["IsActiveMember"].astype("category")
churn_data["Exited"]=churn_data["Exited"].astype("category")
churn_data.dtypes
#Reviewing data info
churn_data.info()
The dataset contains 10000 rows of data and there are no missing values present. So, we can continue with our data exploration
#Reviewing data range for numercial variables
churn_data.describe()
#Reviewing categorical variables
churn_data.describe(include=['category'])
#Get categories of independant variables
print(churn_data.Geography.value_counts())
print(churn_data.Gender.value_counts())
print(churn_data.HasCrCard.value_counts())
print(churn_data.IsActiveMember.value_counts())
#Get categories of dependant variables
print(churn_data.Exited.value_counts())
profile_report = churn_data.profile_report(explorative=False, html={'style': {'full_width': True}})
profile_report